The MSR-Video to Text dataset with clean annotations
نویسندگان
چکیده
Video captioning automatically generates short descriptions of the video content, usually in form a single sentence. Many methods have been proposed for solving this task. A large dataset called MSR to Text (MSR-VTT) is often used as benchmark testing performance methods. However, we found that human annotations, i.e., contents are quite noisy, e.g., there many duplicate captions and contain grammatical problems. These problems may pose difficulties models learning underlying patterns. We cleaned MSR-VTT annotations by removing these problems, then tested several typical on dataset. Experimental results showed data cleaning boosted performances measured popular quantitative metrics. recruited subjects evaluate model trained original datasets. The behavior experiment demonstrated dataset, generated were more coherent relevant clips.
منابع مشابه
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language Supplementary Material
When organizing the Microsoft Research Video To Language challenge [1], we found that, in our previously released dataset [10], some sentences annotated by AMT workers are identical in one video clip or very similar in one category. Therefore, to control the quality of data and annotations, as well as the competitions, we removed those simple and duplicated sentences and replaced them with refi...
متن کاملReasoning with Text Annotations
With the emerging need for automation of business processes and the advent of semantic web it has become necessary that digital contents should be expressed not only in natural language, but also in a form that can be understood, interpreted and used by software agents, thus permitting them to find, share and integrate information more easily. Thus, Knowledge Representation and automated reason...
متن کاملMSR-Asia at TREC-11 Video Track
The Media Computing Group of Microsoft Research Asia participated in all the three tasks of Video tracks of TREC-11, including automatic Shot Boundary Determination, Semantic Feature Extraction and Video Search. A robust shot detector was proposed. Systems for semantic feature extraction and video retrieval which integrated many recent research results of this group’s are presented.
متن کاملvIewIng temporal vIDeo annotatIons
Video is a complex information space that requires advanced navigational aids for effective browsing. The increasing number of temporal video annotations offers new opportunities to provide video navigation according to a user's needs. We present a novel video browsing interface called TAV (Temporal Annotation Viewing) that provides the user with a visual overview of temporal video annotations....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Computer Vision and Image Understanding
سال: 2022
ISSN: ['1090-235X', '1077-3142']
DOI: https://doi.org/10.1016/j.cviu.2022.103581